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We present a new methodology for sufficient dimension reduc- 
tion (SDR). Our methodology derives directly from the formulation 
of SDR in terms of the conditional independence of the covariate X 
from the response Y , given the projection of X on the central sub- 
space [cf. J. Amer. Statist. Assoc. 86 (1991) 316-342 and Regression 
Graphics (1998) Wiley]. We show that this conditional independence 
assertion can be characterized in terms of conditional covariance op- 
erators on reproducing kernel Hilbert spaces and we show how this 
characterization leads to an M-estimator for the central subspace. 
The resulting estimator is shown to be consistent under weak condi- 
tions; in particular, we do not have to impose linearity or ellipticity 
conditions of the kinds that are generally invoked for SDR methods. 
We also present empirical results showing that the new methodology 
is competitive in practice. 

1. Introduction. The problem of sufficient dimension reduction (SDR) 
for regression is that of finding a subspace S such that the projection of 
the covariate vector X onto S captures the statistical dependency of the 
response Y on X. More formally, let us characterize a dimension-reduction 
subspace S in terms of the following conditional independence assertion: 

(1) Y^LX\I[sX, 

where IlgX denotes the orthogonal projection of X onto S. It is possible 
to show that under weak conditions the intersection of dimension-reduction 
subspaces is itself a dimension-reduction subspace, in which case the inter- 
section is referred to as a central subspace [5, 6]. As suggested in a seminal 
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paper by Li [23], it is of great interest to develop procedures for estimating 
this subspace, quite apart from any interest in the conditional distribution 
P{Y\X) or the conditional mean E{Y\X). Once the central subspace is iden- 
tified, subsequent analysis can attempt to infer a conditional distribution or 
a regression function using the (low-dimensional) coordinates n^X. 

The line of research on SDR initiated by Li is to be distinguished from 
the large and heterogeneous collection of methods for dimension reduction 
in regression in which specific modeling assumptions are imposed on the 
conditional distribution P{Y\X) or the regression E{Y\X). These methods 
include ordinary least squares, partial least squares, canonical correlation 
analysis, ACE [4], projection pursuit regression [12], neural networks and 
LASSO [29]. These methods can be effective if the modeling assumptions 
that they embody are met, but if these assumptions do not hold there is no 
guarantee of finding the central subspace. 

Li's paper not only provided a formulation of SDR as a semiparamet- 
ric inference problem — with subsequent contributions by Cook and others 
bringing it to its elegant expression in terms of conditional independence — 
but also suggested a specific inferential methodology that has had significant 
influence on the ensuing literature. Specifically, Li suggested approaching the 
SDR problem as an inverse regression problem. Roughly speaking, the idea 
is that if the conditional distribution P{Y\X) varies solely along a subspace 
of the covariate space, then the inverse regression E{X\Y) should lie in that 
same subspace. Moreover, it should be easier to regress X on y than vice 
versa, given that Y is generally low-dimensional (indeed, one-dimensional in 
the majority of applications) while X is high-dimensional. Li [23] proposed a 
particularly simple instantiation of this idea — known as sliced inverse regres- 
sion (SIR) — in which E{X\Y) is estimated as a constant vector within each 
slice of the response variable Y, and principal component analysis is used to 
aggregate these constant vectors into an estimate of the central subspace. 
The past decade has seen a number of further developments in this vein. 
Some focus on finding a central subspace, for example, [9, 10], while others 
aim at finding a central mean subspace, which is a subspace of the central 
subspace that is effective only for the regression £J[y|X]. The latter include 
principal Hessian directions (pHd, [24]) and contour regression [22]. A par- 
ticular focus of these more recent developments has been the exploitation of 
second moments within an inverse regression framework. 

While the inverse regression perspective has been quite useful, it is not 
without its drawbacks. In particular, performing a regression of X on y 
generally requires making assumptions with respect to the probability dis- 
tribution of X, assumptions that can be difficult to justify. In particular, 
most of the inverse regression methods make the assumption of linearity of 
the conditional mean of the covariate along the central subspace (or make 
a related assumption for the conditional covariance). These assumptions 
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hold in particular if the distribution of X is elliptic. In practice, however, 
we do not necessarily expect that the covariate vector will follow an el- 
liptic distribution, nor is it easy to assess departures from ellipticity in a 
high-dimensional setting. In general, it seems unfortunate to have to impose 
probabilistic assumptions on X in the setting of a regression methodology. 

Many of inverse regression methods can also exhibit some additional lim- 
itations depending on the specific nature of the response variable Y . In par- 
ticular, pHd and contour regression are applicable only to a one-dimensional 
response. Also, if the response variable takes its values in a finite set of p 
elements, SIR yields a subspace of dimension at most p — 1; thus, for the 
important problem of binary classification SIR yields only a one-dimensional 
subspace. Finally, in the binary classification setting, if the covariance ma- 
trices of the two classes are the same, SAVE and pHd also provide only 
a one-dimensional subspace [7]. The general problem in these cases is that 
the estimated subspace is smaller than the central subspace. One approach 
to tackling these limitations is to incorporate higher-order moments of y|X 
[34] , but in practice the gains achievable by the use of higher-order moments 
are limited by robustness issues. 

In this paper, we present a new methodology for SDR that is rather differ- 
ent from the approaches considered in the literature discussed above. Rather 
than focusing on a limited set of moments within an inverse regression frame- 
work, we focus instead on the criterion of conditional independence in terms 
of which the SDR problem is defined. We develop a contrast function for 
evaluating subspaces that is minimized precisely when the conditional in- 
dependence assertion in (1) is realized. As befits a criterion that measures 
departure from conditional independence, our contrast function is not based 
solely on low-order moments. 

Our approach involves the use of conditional covariance operators on re- 
producing kernel Hilbert spaces (RKHSs). Our use of RKHSs is related to 
their use in nonparametric regression and classification; in particular, the 
RKHSs given by some positive definite kernels are Hilbert spaces of smooth 
functions that are "small" enough to yield computationally-tractable proce- 
dures, but are rich enough to capture nonparametric phenomena of interest 
[32], and this computational focus is an important aspect of our work. On 
the other hand, whereas in nonparametric regression and classification the 
role of RKHSs is to provide basis expansions of regression functions and 
discriminant functions, in our case the RKHS plays a different role. Our in- 
terest is not in the functions in the RKHS per se, but rather in conditional 
covariance operators defined on the RKHS. We show that these operators 
can be used to measure departures from conditional independence. We also 
show that these operators can be estimated from data and that these esti- 
mates are functions of Gram matrices. Thus, our approach — which we refer 
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to as kernel dimension reduction (KDR) — involves computing Gram matri- 
ces from data and optimizing a particular functional of these Gram matrices 
to yield an estimate of the central subspace. 

This approach makes no strong assumptions on either the conditional dis- 
tribution PY\nsxiy\^sx) or the marginal distribution px{x). As we show, 
KDR is consistent as an estimator of the central subspace under weak con- 
ditions. 

There are alternatives to the inverse regression approach in the litera- 
ture that have some similarities to KDR. In particular, minimum average 
variance estimation (MAVE, [33]) is based on nonparametric estimation of 
the conditional covariance of Y given X, an idea related to KDR. This 
method explicitly estimates the regressor, however, assuming an additive 
noise model Y = f{X) + Z, where Z is independent of X . While the purpose 
of MAVE is to find a central mean subspace, KDR tries to find a central sub- 
space, and does not need to estimate the regressor explicitly. Other related 
approaches include methods that estimate the derivative of the regression 
function; these are based on the fact that the derivative of the conditional 
expectation g{x) = E[y\B'^x] with respect to x belongs to a dimension re- 
duction subspace [18, 27]. The purpose of these methods is again to extract 
a central mean subspace; this differs from the central subspace which is the 
focus of KDR. The difference is clear, for example, if we consider the situ- 
ation in which a direction 6 in a central subspace satisfies E[g' [b'^ X)] = 0; 
a condition that occurs if g and the distribution of X exhibit certain sym- 
metries. The direction cannot be found by methods based on the derivative. 
Also, there has also been some recent work on nonparametric methods for 
estimation of central subspaces. One such method estimates the central sub- 
space based on an expected log likelihood [35]. This requires, however, an 
estimate of the joint probability density, and is limited to single- index re- 
gression. Finally, Zhu and Zeng [36] have proposed a method for estimating 
the central subspace based on the Fourier transform. This method is simi- 
lar to the KDR method in its use of Hilbert space methods and in its use 
of a contrast function that can characterize conditional independence un- 
der weak assumptions. It differs from KDR, however, in that it requires 
an estimate of the derivative of the marginal density of the covariate X; 
in practice this requires assuming a parametric model for the covariate X. 
In general, we are aware of no practical method that attacks SDR directly 
by using nonparametric methodology to assess departures from conditional 
independence. 

We presented an earlier kernel dimension reduction method in [14]. The 
contrast function presented in that paper, however, was not derived as an 
estimator of a conditional covariance operator, and it was not possible to es- 
tablish a consistency result for that approach. The contrast function that we 
present here is derived directly from the conditional covariance perspective; 
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moreover, it is simpler than the earher estimator and it is possible to estab- 
lish consistency for the new formulation. We should note, however, that the 
empirical performance of the earlier KDR method was shown by Fukumizu, 
Bach and Jordan [14] to yield a significant improvement on SIR and pHd 
in the case of nonelliptic data, and these empirical results motivated us to 
pursue the general approach further. 

While KDR has advantages over other SDR methods because of its gen- 
erality and its directness in capturing the semiparametric nature of the SDR 
problem, it also reposes on a more complex mathematical framework that 
presents new theoretical challenges. Thus, while consistency for SIR and 
related methods follows from a straightforward appeal to the central limit 
theorem (under ellipticity assumptions), more effort is required to study 
the statistical behavior of KDR theoretically. This effort is of some general 
value, however; in particular, to establish the consistency of KDR we prove 
the uniform 0(n~^/^) convergence of an empirical process that takes values 
in a reproducing kernel Hilbert space. This result, which accords with the 
order of uniform convergence of an ordinary real- valued empirical process, 
may be of independent theoretical interest. 

It should be noted at the outset that we do not attempt to provide dis- 
tribution theory for KDR in this paper, and in particular we do not address 
the problem of inferring the dimensionality of the central subspace. 

The paper is organized as follows. In Section 2 we show how conditional in- 
dependence can be characterized by cross-covariance operators on an RKHS 
and use this characterization to derive the KDR method. Section 3 presents 
numerical examples of the KDR method. We present a consistency theorem 
and its proof in Section 4. Section 5 provides concluding remarks. Some of 
the details in the proof of consistency are provided in the Appendix. 

2. Kernel dimension reduction for regression. The method of kernel di- 
mension reduction is based on a characterization of conditional independence 
using operators on RKHSs. We present this characterization in Section 2.1 
and show how it yields a population criterion for SDR in Section 2.2. This 
population criterion is then turned into a finite-sample estimation procedure 
in Section 2.3. 

In this paper, a Hilbert space means a separable Hilbert space, and an 
operator always means a linear operator. The operator norm of a bounded 
operator T is denoted by ||T||. The null space and the range of an operator 
T are denoted by M{T) and TZ{T), respectively. 

2.1. Characterization of conditional independence. Let {X,Bx) and (3^, 
By) denote measurable spaces. When the base space is a topological space, 
the Borel ir-field is always assumed. Let {7ix,kx) and {Tiy, ky) be RKHSs of 
functions on X and 3^, respectively, with measurable positive definite kernels 
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and ky [1] . We consider a random vector {X, Y) '.i} ^ X x y with the 
law PxY ■ The marginal distribution of X and Y are denoted by Px and Py , 
respectively. It is always assumed that the positive definite kernels satisfy 

(2) Ex[kx{X,X)]<oo and Eyiky {Y,Y)] < oo. 

Note that any bounded kernels satisfy this assumption. Also, under this 
assumption, TLx and Hy are included in Lp'{Px) and Lp'{Py), respectively, 
where (//) denotes the Hilbert space of square integrable functions with re- 
spect to the measure /i, and the inclusions Jx ■ Ti-x L'^iPx) and Jy : 7iy 
L?'{Py) are continuous, because ExifiX)^] = Ex[{f,kx{-,X))y^ ] < 

\\Wn^Ex[kx{X,X)]ki feHx. 

The cross- covariance operator of {X,Y) is an operator from Tlx to Tiy 
so that 

(3) {g,^yxf)Hy=ExY[{f{X)-Ex[f{X)]){g{Y)-Ey[g{Y)])] 

holds for all / E TLx and g G Tiy [3, 14]. Obviously, Hyx = S^^y, where 
T* denotes the adjoint of an operator T. If Y is equal to X, the positive 
self-adjoint operator Tixx is called the covariance operator. 

For a random variable X : — > , the vaean element mx G 7ix is defined 
by the element that satisfies 

(4) {f,mx)n;,=Ex[fiX)] 

for all / G TCx', that is, nix = Jx^^ where 1 is the constant function. The ex- 
plicit function form of mx is given by mx{u) = (mx, u))-^^ = E[k(X, u)]. 
Using the mean elements, (3), which characterizes Syx, can be written as 

{9,^Yxf)ny = ExY[{f,kx{-,X) -mx)nx{hi-^^) -^Y,g)ny]- 

Let Qx and Qy be the orthogonal projections which map TCx onto 
TZ{T,xx) and 7iy onto TZ^Eyy), respectively. It is known [3], Theorem 1, 
that T,yx has a representation of the form 

(5) Syx = SyyVVx^^j/x' 

where Vyx'-Tlx -^Tly is a unique bounded operator such that ||Vyx|| < 1 
and Vyx = QyVyxQx- 

A cross-covariance operator on an RKHS can be represented explicitly as 
an integral operator. For arbitrary ip G LF'^Px) and y G 3^, the integral 

(6) G^{y)=[ ky{y,y){ip{x)-Ex[v{X)])dPxY{x,y) 

Jxxy 

always exists and is an element of Lp'{Py). It is not difficult to see that 

Syx.L\Px)^L'{Py), ^^G^ 
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is a bounded linear operator with HS'yxH < EY[ky{Y,Y)]. If / is a function 
in 7ix , we have for any y £y 

Gf{y) = {ky{;y), ^Yxf)Hy = {^Yxf){y), 
which imphes the following proposition: 

Proposition 1. The covariance operator T,yx-T~(-x 'Hy is the re- 
striction of the integral operator Syx to Tix ■ More precisely, 

Jy'^YX = SyxJx- 

Conditional variance can be also represented by covariance operators. 
Define the conditional covariance operator Tiyy\x by 

'^YY\X = '^YY — ^yy^YxVxy'^yy^ 

where Vyx is the bounded operator in (5). For convenience we sometimes 
write T,YY\x ^ 

^YY\X = — ^YX^x^x^XY, 

which is an abuse of notation, because may not exist. 

The following two propositions provide insights into the meaning of a 
conditional covariance operator. The former proposition relates the operator 
to the residual error of regression, and the latter proposition expresses the 
residual error in terms of the conditional variance. 



Proposition 2. For any g G Hy, 

{g,^YY\xg)ny = inf ExyMY) - EY[g{Y)]) - {f{X)-Ex[f{X)])\'. 

Proof. Let Syx = '^yy'^yx^xx decomposition in (5), and de- 

fine EgU) = EYx\{9{Y)-EY[g{Y)])- {f{X)-Ex[f{X)])\\ From the equal- 
ity 

£g{f) = W^TxfWH. - 2{VxYT}^l9,T.f^f)n, + W^TYafny, 

1/2 

replacing T,-^-^f with an arbitrary (j)£7ix yields 

inf ^.(/)> m'n, - 2{VxY^l^Y9,^)n;, + \\^l^Y9fHy} 
= inf 110 - Vxy^yy9\\hx + (s, ^YY\xg)ny 

= {a-, ^YY\X9)ny 
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For the opposite inequality, take an arbitrary e > 0. From the fact that 
Vxy'^yy9 ^ ^(^xx) = ^(^xx)' there exists /* G 7ix such that — 
Vxv'^YYdWHx ^ ^- For such 

Because e is arbitrary, we have inf j^n^ £g{f) < {g,'EYY\x9)'Hy D 

Proposition 2 is an analog for operators of a well-known result on co- 
variance matrices and linear regression: the conditional covariance matrix 
Cyy\x = Cyy — CyxCx\Cxy expresses the residual error of the least 
square regression problem as b^^Cyy^xb = miua E\\b'^Y — a^XW"^. 

To relate the residual error in Proposition 2 to the conditional variance 
of g{Y) given X, we make the following mild assumption: 

f f,a\ + K is dense in L'^{Px), where 7ix +K denotes the direct sum 

^ ' of the RKHS fix and the RKHS M [1]. 

As seen later in Section 2.2, there are many positive definite kernels that 
satisfy the assumption (AS). Examples include the Gaussian radial basis 
function (RBF) kernel k{x,y) = exp( — — y|p/cr^) on M™ or on a compact 
subset of R"\ 

Proposition 3. Under the assumption (AS), 

(7) {g,llyy\x9)Hy=Ex['^^Ty\x[9{Y)\X]] 

for all g £ Tiy . 

Proof. From Proposition 2, we have 

{9,^YY\X9)'Hy 

= inf Varb(y) - f{X)] 

= irif {\^Tx[Ey\x[g{Y)-f{X)\X]]+Ex[\^Ty\x[9{Y)-f{X)\X\\} 

J&nx 

= inf Ys.Tx[Ey\x[9{Y)\X] - f{X)] + Ex[\aTy\x[9{Y)\X]]. 

J&rix 

Let if{x) = Ey\x[9{Y)\X = x\. Since ip G L^{Px) from Var[v9(A:)] < Var[5(y)] < 
oo, the assumption (AS) implies that for an arbitrary e > there exists 
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/ G 7ix and c G M such that h = f + c satisfies \\(p — h\\]^2(^p^'^ < e. Be- 
cause Var[9?(X) — f{X)] < \\ip — ^|||2(p_^) < and e is arbitrary, we have 
infjgT^-j, Ya,T:x[EY\x[9{Y)\X] — f{X)] = 0, which completes the proof. □ 

Proposition 3 improves a result due to Fukumizu, Bach and Jordan [14], 
Proposition 5, where the much stronger assumption E[g{Y)\X = ■] £ TCx 
was imposed. 

Propositions 2 and 3 imply that the operator Syy|x can be interpreted 
as capturing the predictive ability for Y of the explanatory variable X. 

2.2. Criterion of kernel dimension reduction. Let M{m x n;M) be the 
set of real-valued m x n matrices. For a natural number d<m, the Stiefel 
manifold S™(M) is defined by 

§2" W = G M(m xd;R)\B^B = Id}, 

which is the set of all d orthonormal vectors in MJ^. It is well known that 
S™(M) is a compact smooth manifold. For B G §Jp(M), the matrix BB^ 
defines an orthogonal projection of onto the d-dimensional subspace 
spanned by the column vectors of B. Although the Grassmann manifold 
is often used in the study of sets of subspaces in M™, we find the Stiefel 
manifold more convenient as it allows us to use matrix notation explicitly. 

Hereafter, X is assumed to be either a closed ball Dm{r) = {x £ R"^| ||x|| < 
r} or the entire Euclidean space W^; both assumptions satisfy the condition 
that the projection BB'^X is included in X for all B G §™(M). 

Let C S2^(M) denote the subset of matrices whose columns span a dimen- 
sion-reduction subspace; for each Bq G B^, we have 

(8) PY\xiy\x) =PY\BTxiy\Bj x), 

where PY\xiy\^) and Py|B^x(y|^) the conditional probability densities 
of Y given X, and Y given B^X, respectively. The existence and positivity 
of these conditional probability densities are always assumed hereafter. As 
we have discussed in the Introduction, under conditions given by [6], Section 
6.4, this subset represents the central subspace (under the assumption that 
d is the minimum dimensionality of the dimension reduction subspaces). 

We now turn to the key problem of characterizing the subset B^ using 
conditional covariance operators on reproducing kernel Hilbert spaces. In the 
following, we assume that k[i{z, z) is a positive definite kernel on ^ = D(i{r) 
or such that E x[kd{B'^ X , B"^ X)] < oo for all B G S3"(M), and we let k^ 
denote a positive definite kernel on X given by 



(9) 



k^{x,x) = kdiB'^x,B'^x) 
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for each B G S™(M). The RKHS associated with is denoted by H^. 
Note that = {/ : ;f ^ M|there exists g G Hk^ such that f{x) = g{B'^x)}, 
where TCk^ is the RKHS given by k^- As seen later in Theorem 4, if X and 
y are subsets of Euchdean spaces and Gaussian RBF kernels are used for 
kx and ky , under some conditions the subset B^^ is characterized by the set 
of solutions of an optimization problem 

(10) BS' = argminS^y|^, 

_Be§™(M) 

where "^yx ^^'^ ^xx denote the (cross-) covariance operators with respect 
to the kernel k^ , and 

^YY\X — ~ ^YX^XX ^XY- 

The minimization in (10) refers to the minimal operators in the partial order 
of self-adjoint operators. 

We use the trace to evaluate the partial order of self-adjoint operators. 
While other possibilities exist (e.g., the determinant), the trace has the ad- 
vantage of yielding a relatively simple theoretical analysis, which is con- 
ducted in Section 4. The operator T^yyix trace class for all B G SJJ^(M), 
since Tiyy^x — ^yy Tr[Syy] < oo, which is shown in Section 4.2. Hence- 
forth the minimization in (10) should thus be understood as that of mini- 
mizing Tr[S^y|^]. 

From Propositions 2 and 3, minimization of Tr[Syy|j^] is equivalent to the 
minimization of the sum of the residual errors for the optimal prediction of 
functions of Y using B^X, where the sum is taken over a complete orthonor- 
mal system {Ca}'^i of TCy. Thus, the objective of dimension reduction is 
rewritten as 

oo 

(11) min E\ iUY) - E[UY)]) - {f{X) - E[f{X)]f. 

This is intuitively reasonable as a criterion of choosing i?, and we will see 
that this is equivalent to finding the central subspace under some conditions. 

We now introduce a class of kernels to characterize conditional indepen- 
dence. Let (r2,B) be a measurable space, let {7i, k) be an RKHS over O with 
the kernel k measurable and bounded, and let S be the set of all probability 
measures on {VL,B). The RKHS Ti is called characteristic (with respect to 
B) if the map 

(12) S3 P^mp = Ex^p[k{-,X)](^n 

is one-to-one, where mp is the mean element of the random variable with 
law P. It is easy to see that 7i is characteristic if and only if the equality 
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/ f dP = J f dQ for a\l f gTC means P = Q. We also call a positive definite 
kernel k characteristic if the associated RKHS is characteristic. 

It is known that the Gaussian RBF kernel exp(— ||x — yp/c^) and the so- 
called Laplacian kernel ex.p{—aJ2iLi \ xi — yi\) (a > 0) are characteristic on 
M™- or on a compact subset of M™ with respect to the Borel cr-field [2, 15, 28]. 

The following theorem improves Theorem 7 in [14], and is the theoretical 
basis of kernel dimension reduction. In the following, let Pb denote the 
probability on X induced from Px by the projection BB^ -.X^X. 

Theorem 4. Suppose that the closure of the Tl^ in L'^(Px) is included 
in the closure of TLx in (Px ) for any B £ Sjp (M) . Then, 

(13) '^YY\X — ^YY\X^ 

where the inequality refers to the order of self- adjoint operators. If further 
{Hx^Px) and iH^,PB) satisfy (AS) for every B e S;p(M) and Hy is char- 
acteristic, the following equivalence holds 

(14) J:yy\x = ^yy\x ^ YALX\B^X. 

Proof. The first assertion is obvious from Proposition 2. For the second 
assertion, let C be an m x (m — d) matrix whose columns span the orthogonal 
complement to the subspace spanned by the columns of B, and let (U, V) = 
{B'^X,C'^X) for notational simplicity. By taking the expectation of the 
well-known relation 

Vary|^b(y)|C/] =£;y|l/[Vary|f;yb(y)|[/,l^]] +Vary|f;[ii;y|^,yb(y)|C/,l^]] 

with respect to V, we have 

Eu[VarYiu[9{Y)\U]] 

= Ex[Ve^ry^x[9iY)\X]]+ Eu[Ve^ry^u[EY\u,v[giy)\U,V]]], 

from which Proposition 3 yields 

id, i^YY 

1^ - ^YY\x)9)Hy = Eu[Vary\u[EY\u,v[9{Y)\U,V]]]. 

It follows that the right-hand side of the equivalence in (14) holds if and only 
if EY\ijy[g(Y)\U,V] does not depend on V almost surely. This is equivalent 
to 

EYix[9iY)\X]=EYiu[9(Y)\U] 

almost surely. Since TCy is characteristic, this means that the conditional 
probability of Y given X is reduced to that of Y given U . □ 

The assumption (AS) and the notion of characteristic kernel are closely 
related. In fact, from the following proposition, (AS) is satisfied if a charac- 
teristic kernel is used. Thus, if y is Euclidean, the choice of Gaussian RBF 
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kernels for kd, kx and ky is sufficient to guarantee the equivalence given by 
(14). 

Proposition 5. Let {^,B) be a measurable space, and {k,7i) be a bounded 
measurable positive definite kernel on Q and its RKHS. Then, k is charac- 
teristic if and only i/H + M is dense in L?'{P) for any probability measure 
P on {VL,B). 

Proof. For the proof of "if" part, suppose mp = mq for P ^ Q. De- 
note the total variation oi P — Q by \P — Q\. Since + M is dense in 
— Ql), for arbitrary e > and A € B, there exists / G W + M such 
that / I/ — Ia\ d\P — Q\ < e, where Ia is the index function of A. It follows 
that \{Ep[f{X)] - P{A)) - {EglfiX)] - Q{A))\ < e. Because Ep[f{X)] = 
EQ[f{X)] from mp = mg, we have |-P(^) — < £ for any e > 0, which 

contradicts P ^ Q. 

For the opposite direction, suppose + R is not dense in L?'{P). There is 
nonzero / € L^{P) such that / / dP = and / ftp dP = for any (p gH. Let 
c = l/||/||j;^i(P), and define two probability measures Qi and Q2 by Qi{E) = 
c/e I/I dP and Q2{E) = c J^ilfl — f) dP for any measurable set £'. By / / 0, 
we have Qi /Qa, while ^q, [A;(-,X)] - EqJA;(-,X)] = c/ /(x)A;(-,x) dP(x) = 
0, which means k is not characteristic. □ 

2.3. Kernel dimension reduction procedure. We now use the characteri- 
zation given in Theorem 4 to develop an optimization procedure for estimat- 
ing the central subspace from an empirical sample {(Xi, li), . . . , 1^)}. 
We assume that {[Xi^Yi)^ . . . , (X„,y„)} is sampled i.i.d. from Pxy and we 
assume that there exists Bq G such that PY\x{y\^) = Py\b'^ xiy\^Q ^) • 

We define the empirical cross-covariance operator ^yx evaluating the 
cross-covariance operator at the empirical distribution ^ X^iLi ^Xi^Yf When 
acting on functions / € 7ix and g G 7iy, the operator ^yx gives the empir- 
ical covariance 

Also, for B £ S2^(M), let Tiyy^x denote the empirical conditional covariance 
operator: 

[10) 2^YY\X ~ ■^YY ~ ^YX \^XX ^ ^n^- ) ^XY ' 

The regularization term e„/ (e„ > 0) is required to enable operator inversion 
and is thus analogous to Tikhonov regularization [17]. We will see that the 
regularization term is also needed for consistency. 
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We now define the KDR estimator as any minimizer of Tr[S^^"'^ 



on the manifold §!; 



that is, any matrix in SJJ^(M) that minimizes 



YY\Xi 



^YX \^XX ^'^ni-j 



(16) T¥[S^^ 
In view of (11), this is equivalent to minimizing 

1 



■•XY 



E 



mm 

B 
X 



E 



1 " 



2 



over B G S™(]R), where {^a}^;^ is a complete orthonormal system for Jiy. 

The KDR contrast function in (16) can also be expressed in terms of 
Gram matrices (given a kernel A;, the Gram matrix is the n x n matrix 
whose entries are the evaluations of the kernel on all pairs of n data points) . 
Let G and ipi G 7iy {I <i <n) be functions defined by 



1 



1 



Because TZ{Y, 



Bin). 
XX ) 



^XX 



and TZiT, 



YY) 



:AA(Sf) 

B{n) 



YY) 



are spanned by 



^ and (V'j)F=i) respectively, the trace of T,yy\x equal to that of 
the matrix representation of Tiyy-\x linear hull of (^/^j)"^]^. Note that 

although the vectors ('i/'i)iLi are over-complete, the trace of the matrix rep- 
resentation with respect to these vectors is equal to the trace of the operator. 
For B G §™(M), the centered Gram matrix with respect to the kernel 
is defined by 



{Gx)ij 



k^{X,,X,) --Y^k^{X,,X,) k^{Xa,X,) 
n n , 



6=1 



a=l 



1 



+ ;;2EE^^(^-^^) 

a=\h=\ 



and Gy is defined similarly. By direct calculation, it is easy to obtain 

-| n 1 " 

-E^^Gyh --Y.UG^x{G^x + nenIn)-'GY)^,. 



^YY\X^^ 
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It follows that the matrix representation of T,yy\x with respect to (tpi)^^-^ 
is ^{Gy — Gx{Gx +'nenIn)~^GY} and its trace is 

I n 

= e„TV[Gy(Gf + 

Omitting the constant factor, the KDR contrast function in (16) thus reduces 
to 

(17) TV[Gy(Gf + ne„/„)"i]. 

The KDR method is defined as the optimization of this function over the 
manifold 

Theorem 4 is the population justification of the KDR method. Note that 
this derivation imposes no strong assumptions either on the conditional 
probability of Y given X, or on the marginal distributions of X and Y . 
In particular, it does not require ellipticity of the marginal distribution of 
X, nor does it require an additive noise model. The response variable Y may 
be either continuous or discrete. We confirm this general applicability of the 
KDR method by the numerical results presented in the next section. 

Because the contrast function (17) is nonconvex, the minimization re- 
quires a nonlinear optimization technique; in our experiments we use the 
steepest descent method with line search. To alleviate potential problems 
with local optima, we use a continuation method in which the scale param- 
eter a in Gaussian RBF kernel exp(— ||a; — y||/cj^) is gradually decreased 
during the iterative optimization process. In numerical examples shown in 
the next section, we used a fixed number of iterations, and decreased cr^ 
linearly from o"^ = 100 to cj^ = 10 for standardized data with standard devi- 
ation 5.0. Since the covariance operator approaches the covariance operator 
induced by a linear kernel as <t — > oo , which is solvable as an eigenproblem. 

In addition to u, there is another tuning parameter the regularization 
coefficient. As both of these tuning parameters have a similar smoothing 
effect, it is reasonable to fix one of them and select the other; in our experi- 
ments we fixed e„ = 0.1 as an arbitrary choice and varied <t^. While there is 
no theoretical guarantee for this choice, we observe the results are generally 
stable if the optimization process is successful. There also exist heuristics 
for choosing kernel parameters in similar RKHS-based dependency analysis; 
an example is to use the median of pairwise distances of the data for the 
parameter a in the Gaussian RBF kernel [16]. Currently, however, we are 
not aware of theoretically justified methods of choosing these parameters; 
this is an important open problem. 

The proposed estimator is shown to be consistent as the sample size goes 
to infinity. We defer the proof to Section 4. 
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3. Numerical results. 



3.1. Simulation studies. In this section we compare the performance of 
the KDR method with that of several well-known dimension reduction meth- 
ods. Specifically, we compare to SIR, pHd and SAVE on synthetic data sets 
generated by the regressions in Examples 6.2, 6.3 and 6.4 of [22]. The results 
are evaluated by computing the Frobenius distance between the projection 
matrix of the estimated subspace and that of the true subspace; this evalu- 
ation measure is invariant under change of basis and is equal to 

where Bq and B are matrices in the Stiefel manifold S2^(IR) representing 
the true subspace and the estimated subspace, respectively. For the KDR 
method, a Gaussian RBF kernel exp(— ||zi — ^2|P/c) was used, with c = 
2.0 for regression (A) and regression (C) and c = 0.5 for regression (B). 
The parameter estimate B was updated 100 times by the steepest descent 
method. The regularization parameter was fixed at e = 0.1. For SIR and 
SAVE, we optimized the number of slices for each simulation so as to obtain 
the best average norm. 

Regression (A) is given by 

where X ^ N^O,!/^) is a four-dimensional explanatory variable, and i? ~ 
A^(0, 1) is independent of X. Thus, the central subspace is spanned by the 
vectors (1,0,0,0) and (0,1,0,0). For the noise level a, three different val- 
ues were used: o" = 0.1,0.4 and 0.8. We used 100 random replications with 
100 samples each. Note that the distribution of the explanatory variable X 
satisfies the ellipticity assumption, as required by the SIR, SAVE and pHd 
methods. 

Table 1 shows the mean and the standard deviation of the Frobenius norm 
over 100 samples. We see that the KDR method outperforms the other three 
methods in terms of estimation accuracy. It is also worth noting that in the 
results presented by Li, Zha and Chiaromonte [22] for their GCR method, 
the average norm was 0.28, 0.33, 0.45 for a = 0.1, 0.4, 0.8, respectively; again, 
this is worse than the performance of KDR. 

The second regression is given by 

(B) Y = sm^{7rX2 + l) + aE, 

where X G is distributed uniformly on the set 

[0, 1]"^ \{x£ MVi < 0.7 (i = 1, 2, 3, 4)}, 
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Table 1 

Comparison of KDR and other methods for regression (A) 



(T 


KDR 




SIR 




SAVE 




pHd 




NORM 


SD 


NORM 


SD 


NORM 


SD 


NORM 


SD 


0.1 


0.11 


0.07 


0.55 


0.28 


0.77 


0.35 


1.04 


0.34 


0.4 


0.17 


0.09 


0.60 


0.27 


0.82 


0.34 


1.03 


0.33 


0.8 


0.34 


0.22 


0.69 


0.25 


0.94 


0.35 


1.06 


0.33 



and E ~ iV(0, 1) is independent noise. The standard deviation a is fixed at 
a = 0.1,0.2 and 0.3. Note that in this example the distribution of X does 
not satisfy the eUipticity assumption. 

Table 2 shows the resuhs of the simulation experiments for this regression. 
We see that KDR again outperforms the other methods. 

The third regression is given by 

(C) Y=^{Xi-afE, 

where X ~ A^(0,Iio) is a ten-dimensional variable and E ~ A^(0, 1) is inde- 
pendent noise. The parameter a is fixed at a = 0, 0.5 and 1. Note that in this 
example the conditional probability p{y\x) does not obey an additive noise 
assumption. The mean of Y is zero and the variance is a quadratic function 
of Xi. We generated 100 samples of 500 data. 

The results for KDR and the other methods are shown by Table 3, in 
which we again confirm that the KDR method yields significantly better 
performance than the other methods. In this case, pHd fails to find the 
true subspace; this is due to the fact that pHd is incapable of estimating 
a direction that only appears in the variance [8]. We note also that the 
results in [22] show that the contour regression methods SCR and GCR 
yield average norms larger than 1.3. 

Although the estimation of variance structure is generally more difficult 
than that of estimating mean structure, the KDR method nonetheless is 
effective at finding the central subspace in this case. 



Table 2 

Comparison of KDR and other methods for regression (B) 



tT 


KDR 




SIR 




SAVE 




pHd 




NORM 


SD 


NORM 


SD 


NORM 


SD 


NORM 


SD 


0.1 


0.05 


0.02 


0.24 


0.10 


0.23 


0.13 


0.43 


0.19 


0.2 


0.11 


0.06 


0.32 


0.15 


0.29 


0.16 


0.51 


0.23 


0.3 


0.13 


0.07 


0.41 


0.19 


0.41 


0.21 


0.63 


0.29 
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Table 3 

Comparison of KDR and other methods for regression (C) 



a 


KDR 




SIR 




SAVE 




pHd 




NORM 


SD 


NORM 


SD 


NORM 


SD 


NORM 


SD 


0.0 


0.17 


0.05 


1.83 


0.22 


0.30 


0.07 


1.48 


0.27 


0.5 


0.17 


0.04 


0.58 


0.19 


0.35 


0.08 


1.52 


0.28 


1.0 


0.18 


0.05 


0.30 


0.08 


0.57 


0.20 


1.58 


0.28 



3.2. Applications. We apply the KDR method to two data sets; one is a 
binary classification problem and the other is a regression with a continuous 
response variable. These data sets have been used previously in studies of 
dimension reduction methods. 

The first data set that we studied is Swiss bank notes which has been 
previously studied in the dimension reduction context by Cook and Lee [7], 
with the data taken from [11]. The problem is that of classifying counter- 
feit and genuine Swiss bank notes. The data is a sample of 100 counterfeit 
and 100 genuine notes. There are six continuous explanatory variables that 
represent aspects of the size of a note: length, height on the left, height 
on the right, distance of inner frame to the lower border, distance of inner 
frame to the upper border and length of the diagonal. We standardize each 
of explanatory variables so that their standard deviation is 5.0. 

As we have discussed in the Introduction, many dimension reduction 
methods (including SIR) are not generally suitable for binary classification 
problems. Because among inverse regression methods the estimated sub- 
space given by SAVE is necessarily larger than that given by pHd and SIR 
[7], we compared the KDR method only with SAVE for this data set. 

Figure 1 shows two-dimensional plots of the data projected onto the sub- 
spaces estimated by the KDR method and by SAVE. The figure shows that 
the results for KDR appear to be robust with respect to the values of the 
scale parameter a in the Gaussian RBF kernel. (Note that if a goes to infin- 
ity, the result approaches that obtained by a linear kernel, since the linear 
term in the Taylor expansion of the exponential function is dominant.) In 
the KDR case, using a Gaussian RBF with scale parameter a = 10 and 100 
we obtain clear separation of genuine and counterfeit notes. Slightly less 
separation is obtained for the Gaussian RBF kernel with a = 10,000, for the 
linear kernel and for SAVE; in these cases there is an isolated genuine data 
point that lies close to the class boundary, which is similar to the results 
using linear discriminant analysis and specification analysis [11]. We see that 
KDR finds a more effective subspace to separate the two classes than SAVE 
and the existing analysis. Finally, note that there are two clusters of counter- 
feit notes in the result of SAVE, while KDR does not show multiple clusters 
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Fig. 1. Two-dimensional plots of Swiss bank notes. The crosses and circles show gen- 
uine and counterfeit notes, respectively. For the KDR methods, the Gaussian RBF kernel 
exp(— ||zi — 22|| /a) is used with a = 10, 100 and 10,000. For comparison, the plots given 
by KDR with a linear kernel and SAVE are shown. 
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in either class. Although clusters have also been reported in other analyses 
[11], Section 12, the KDR results suggest that the cluster structure may not 
be relevant to the classification. 

We also analyzed the Evaporation data set, available in the Arc pack- 
age (http : //www. Stat .umn.edu/arc/ software .html). The data set is con- 
cerned with the effect on soil evaporation of various air and soil conditions. 
The number of explanatory variables is 10: maximum daily soil temperature 
(Maxst), minimum daily soil temperature (Minst), area under the daily soil 
temperature curve (Avst), maximum daily air temperature (Maxat), mini- 
mum daily air temperature (Minat), average daily air temperature (Avat), 
maximum daily humidity (Maxh), minimum daily humidity (Minh), area 
under the daily humidity curve (Avh) and total wind speed in miles/hour 
(Wind). The response variable is daily soil evaporation (Evap). The data 
were collected daily during 46 days; thus, the number of data points is 
46. This data set was studied in the context of contour regression meth- 
ods for dimension reduction in [22]. We standardize each variable so that 
the sample variance is equal to 5.0, and use the Gaussian RBF kernel 
exp(-||zi - Z2IP/IO). 

Our analysis yielded an estimated two-dimensional subspace which is 
spanned by the vectors: 

KDRi: -0.25 MAXST + 0.32 MINST + 0m AVST + {-0.28) MAXAT 

+ (-0.23) MINAT + (-0.44) AVAT + 0.39 MAXH + 0.25 MINH 

+ (-0.07) AVH + (-0.54) WIND. 
KDR2: 0.0^ MAXST + {-0.02) MINST + 0.00 AVST + 0.10 MAXAT 

+ (-0.45) MINAT + 0.23 AVAT + 0.21 MAXH + (-0.41) MINH 

+ (-0.71) AVH + (-0.05) WIND. 

In the first direction. Wind and Avat have a large factor with the same sign, 
while both have weak contributions on the second direction. In the second 
direction, Avh is dominant. 

Figure 2 presents the scatter plots representing the response Y plotted 
with respect to each of the first two directions given by the KDR method. 
Both of these directions show a clear relation with Y . Figure 3 presents 
the scatter plot of Y versus the two-dimensional subspace found by KDR. 
The obtained two-dimensional subspace is different from the one given by 
the existing analysis in [22] ; the contour regression method gives a subspace 
in which the first direction shows a clear monotonic trend, but the second 
direction suggests a JJ-shaped pattern. In the result of KDR, we do not see 
a clear folded pattern. Although without further analysis it is difficult to 
say which result expresses more clearly the statistical dependence, the plots 
suggest that the KDR method successfully captured the effective directions 
for regression. 
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Fig. 2. Two-dimensional representation of Evaporation data for each of the first two 
directions. 

4. Consistency of kernel dimension reduction. In this section we prove 
that the KDR estimator is consistent. Our proof of consistency requires tools 
from empirical process theory, suitably elaborated to handle the RKHS set- 
ting. We establish convergence of the empirical contrast function to the pop- 
ulation contrast function under a condition on the regularization coefficient 
£n, and from this result infer the consistency of 

4.1. Main result. We assume hereafter that 3^ is a topological space. The 
Stiefel manifold S2^(M) is assumed to be equipped with a distance D which 
is compatible with the topology of S™(M). It is known that geodesies define 
such a distance (see, e.g., [19], Chapter IV). 




Fig. 3. Three-dimensional representation of Evaporation data. 
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The following technical assumptions are needed to guarantee the consis- 
tency of kernel dimension reduction: 

(A-1) For any bounded continuous function g on y, the function 

B^Ex[EY\BTx[giy)\B^Xf] 

is continuous on S2^(M). 

(A-2) For B £ S™(M), let Pb be the probability distribution of the random 
variable BB^X on X. The Hilbert space +M is dense in L'^{Pb) for any 
BeS'J'iR). 

(A-3) There exists a measurable function </> : A" — > M such that E\(j){X)\'^ < 
oo and the Lipschitz condition 

WUB^'x, •) - kd{B^x, ■)\\n, < Hx)D{B, B) 

holds for all B, Be S'XO^) andxeX. 

Theorem 6. Suppose in (9) is continuous and bounded, and suppose 
the regularization parameter e.„ in (15) satisfies 

(18) e„ ^ 0, n^^'^En ^ oo (n^oo). 

Define the set of the optimum parameters by 

B3^ = argminTr[Sfy|^]. 

Under the assumptions (A-1), (A-2) and (A-3), the set is nonempty, 
and for an arbitrary open set U in §2^(]R) with gU we have 

lim Pr(B(") £U) = 1. 

n— »oo 

Note that Theorem 6 holds independently of any requirement that the 
population contrast function characterizes conditional independence. If the 
additional conditions of Theorem 4 are satisfied, then the estimator con- 
verges in probability to the set of sufficient dimension-reduction subspaces. 

The assumptions (A-1) and (A-2) are used to establish the continuity of 
Tr[Syy|j^] in Lemma 13, and (A-3) is needed to derive the order of uniform 

convergence of Sy^^ in Lemma 9. 

The assumption (A-1) is satisfied in various cases. Let f{x) = EY\x[g(X)\X = 
x], and assume /(x) is continuous. This assumption holds, for example, if the 
conditional probability density py|x(y|x) is bounded and continuous with re- 
spect to X. Let C be an element of §^_^(M) such that the subspaces spanned 
by the column vectors of B and C are orthogonal; that is, the mx m matrix 
{B, C) is an orthogonal matrix. Define random variables U and V hy U = 
B^X and V = C^X. If X has the probability density function px{x), the 
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probability density function of {U, V) is given by Pu,v{u, v) = px{Bu + Cv). 
Consider the situation in which u is given by M = B'^x for B G ^'^{M) and 
i^X, and let Vb,x = {f G W^-'^\BB'^i + G X}. We have 

J^^^px{BBTx + Cv)dv 

If there exists an integrable function r{v) such that xVb xi'^)Px{BB^x + 
Cv) < r{v) for all B S §JJ^(M) and x £ X, the dominated convergence theorem 
ensures (A-1). Thus, it is easy to see that a sufficient condition for (A-1) 
is that X is bounded, pxix) is bounded, and PY\xiy\x) is bounded and 
continuous on x, which is satisfied by a wide class of distributions. 

The assumption (A-2) holds if X is compact and kd + 1 is a universal 
kernel on Z. The assumption (A-3) is satisfied by many useful kernels; for 
example, kernels with the property 



92 

-kd{zi,Z2) 



<L\\zi — Z2\\ (a, 6 = 1,2) 



dzadzb 

for some L > 0. In particular Gaussian RBF kernels satisfy this property. 

4.2. Proof of the consistency theorem. If the following proposition is 
shown, Theorem 6 follows straightforwardly by standard arguments estab- 
lishing the consistency of M-estimators (see, e.g., [31], Section 5.2). 

Proposition 7. Under the same assumptions as Theorem 6, the func- 
tions Tt\Y^y^y\x\ ^^'^ '^^Pyy|x] ^'^^ continuous on §JJ*(]R), and 

sup |TV[S^?J;y-TV[Sfy|^]HO (n^oo) 

in probability. 



The proof of Proposition 7 is divided into several lemmas. We decompose 
sup^ I Tr[Syy|j^] — Tr[Sy^^^]| into two parts: sup^ | Tr[Syy|^] — Tr[Syy — 
^^xi^^x + en/)^^Sf y]| and sup^ | T¥[Syy - ^^xi^xx + ^nir'^xy] - 
Tr[Sy^^^]| . Lemmas 8, 9 and 10 establish the convergence of the second part. 
The convergence of the first part is shown by Lemmas 11-14; in particular. 
Lemmas 12 and 13 establish the key result that the trace of the population 
conditional covariance operator is a continuous function of B. 

The following lemmas make use of the trace norm and the Hilbert- 
Schmidt norm of operators. For a discussion of these norms, see [26], Section 
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VI and [20], Section 30. Recall that the trace of a positive operator ^ on a 
Hilbert space 7i is defined by 

oo 

t=i 

where {(pi}'^i is a complete orthonormal system (CONS) of TC. A bomided 
operator T on a Hilbert space n is called trace class if Tr[(r*r)i/2] is finite. 
The set of all trace class operators on a Hilbert space is a Banach space with 
the trace norm ||r||tr = Tr[(T*T)^^'^]. For a trace class operator T on Ti., the 
series J2i^i{VijTipi) converges absolutely for any CONS {ipi}^i, and the 
limit does not depend on the choice of CONS. The limit is called the trace 
of T, and denoted by Tr[T]. It is known that | Tr[T]| < ||r||tr- 

A bounded operator T -.TCi ^ TC2, where Tii and TC2 are Hilbert spaces, is 
called Hilbert- Schmidt if Tr[T*r] < oo, or equivalently, J^i^i W'^^iW'^^ ^ °° 
for a CONS of Tii. The set of all Hilbert-Schmidt operators from 

Til to 7i2 is a Hilbert space with Hilbert-Schmidt inner product 

oo 

1=1 

where {(pi}^i is a CONS of Tii. Thus, the Hilbert-Schmidt norm ||T||hs 

satisfies \\T\\ls = E^^l\\Tv^\\H,■ 

Obviously, ||T|| < ||T||hs ^ ll^lltr holds, if T is trace class or Hilbert- 
Schmidt. Recall also that if A is trace class (Hilbert-Schmidt) and B is 
bounded, AB and BA are trace class (Hilbert-Schmidt, resp.), for which 
ll-B^lltr < ll^llPlltr and \\AB\\tr < ||-B||P||tr (P-B||hs < PIIII-BIIhs and 
II-B^IIhs < PIIII-BIIhs)- If and B -.712^ Hi are Hilbert-Schmidt, 

the product AB is trace-class with ||Ai?||tr < ||A||hs||-S||hs- 

It is known that cross-covariance operators and covariance operators are 
Hilbert-Schmidt and trace class, respectively, under the assumption (2) [13, 
16]. The Hilbert-Schmidt norm of Syx is given by 

(19) W^YxWls = \\EYx[{kx{;X)-mx){ky{.,Y)-mY)\\\n^r^ny^ 

where 7ix ® T~iy is the direct product of 7ix and 7iy^ and the trace norm 
of Sxx is 

(20) Tr[Sxx] = Ex[\\kx{-,X) - mxfn^]. 
Lemma 8. 

|TV[S^"^I^] - Tr[Syy - Syx(Sxx + enI)-^^XY]\ 
< —{{\0Yx\\ns + l|Syx||Hs)||S^_^ - Syxllns 
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+ ll^yy lltrll^xx ~ ^-^^11} 

+ \Tv0yl-^YY]\. 

Proof. Noting that the self-adjoint operator Tiyxi^xx -'r £nI)~^'^XY 
is trace class from T,yx{^xx + £rJ)~^'^XY ^ ^yy, the left-hand side of the 
assertion is bounded from above by 

|TV[sf?^ - Syy]| + \Tv0^l0^l,+enI)''^PY - ^Yx{^XX+enir^^XY]\. 

The second term is upper-bounded by 

+ \Tt[J:yx0xx + ^nir'0xl - Sxy)]| 

+ |Tr[Syx{(s5'i + en/)"' - (S^x + en/)"'}Sxy]| 

<\\0A-^yx)0A+^nI)-'f^Py\L 



+ \\^YX0XX + enir'0xY - ^Xy)\\ 



tr 



+ \Ti[^Yxi^xx + enir 



'1/2 



x(Sxx+en/)-'/'Sxy]| 
^ ^ll^yi - ^^^^llHsll^xyllns + — ll^^yxllnsllS^y - ^xyllns 

+ II (SxX + enlf'\T.^xX + enir\^XX + Enlf'^ " /|| 



X 



||(Sxx+en/)^'/'SxySyx(Sxx + en/)^'/' 



2| 



tr- 



In the last line, we use |Tr[yl5yl*]| < ||B||pM||tr for a Hilbert-Schmidt 
operator A and a bounded operator B. This is confirmed easily by the 
singular decomposition of A. 

Since the spectrum of A* A and A A* are identical, we have 

\\{T.xx + enlf'\T.'^xx+^nir\'^xx+enlf'^ - 1\\ 

= \\{tf^+enl)-^'\^xx + enl)0xx+enir^''' - 1\\ 
< \\0xl + e^ir''\^xx - + 

The bound ||(Sxx + en/)-'/'S^^x^xy|| < 1 yields 

||(SxX + e„/)-'/'SxySyx(SxX + en/)"'/'||tr < ||Syy ||tr, 
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which concludes the proof. □ 

Lemma 9. Under the assumption (A-3), 

sup \\i^xx ~^xxIIhs' ''^P W^xy ~^xyllHS 



and 



sup |Tr[£j(7^ - S^y] 



are of order Op{l/^/n) as oo. 

The proof of Lemma 9 is deferred to the Appendix. From Lemmas 8 and 
9, the fohowing lemma is obvious. 

Lemma 10. If the regularization parameter (en)'^=i satisfies (18), under 
the assumption (A-3) we have 

sup |Tr[sJ(?) ] - Trpyy - ^^xi^^xx + ^n/)^'5]f y]| = Op(e;^n-V2) 
Bes™(K) 

as CO. 

In the next four lemmas, we establish the uniform convergence of to 
Lq (e i 0), where L^{B) is a function on §JP(M) defined by 

L,{B) = Tt[^^x{^xx + eir^T.^xY] 

for e > and Lq{B) = Tr[Syy Vy^^V^ySyy] . We begin by establishing point- 
wise convergence. 

Lemma 11. For arbitrary kernels with (2), 

TV[Syx(Sxx + eiy^^xv] ^ Ti[t}JIVyxVxy^'^y] i 0)- 

Proof. With a CONS {V'ili^i for Hy, the difference of the right-hand 
side and the left-hand side can be written as 



^(Vi, t}41Vyx{I - -^Txi^xx + eI)~^T}j,l}VxYY.\iliM)ny. 
1=1 

1 /2 1 /2 

Since each summand is positive and upper bounded by (V'i, T^yy^y xyxv'^YY ^ 
i^i)'Hyj s-iid the sum over i is finite, by the dominated convergence theorem 
it suffices to show 

hm(V', ^^Jy^yxU - ^Txi^xx + eir^^]{l}VxY^]^Y^)ny = 
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for each tp G TCy . 

Fix arbitrary ip £ TLy and (5 > 0. From the fact TZiVxy) C lZ(T,xx), there 

1 /2 

exists h G Tlx such that HVxySy-yV' — HxxhWnx < ^- Using the fact / — 
^^^(Sxx + Eniy^T^^^c = ^ni'^xx + £nl)~^ , wc have 

||{/ - + 6/)-iS^/l}yxy4y^llw^ 

+ ||e(Sxx + e/)"HFxySyyV' - 
<e\\h\\H;,+5, 

which is arbitrarily small if e is sufficiently small. This completes the proof. 
□ 

Lemma 12. Suppose is continuous and bounded. Then, for any e > 0, 
the function L^{B) is continuous on S™(]R). 

Proof. By an argument similar to that in the proof of Lemma 11, it 
suffices to show the continuity oi B ^ {ip^Tiyxi'^^x + ^-^)^^^xy^)^y f*^'^ 
each ip € TLy . 

Let Jx-'Hx^ L'^{Px) and Jy ■'Hy ^ L^{Py) be inclusions. As seen m 
Proposition 1, the operators ^yx ^xx '^^^ be extended to the integral 
operators Syx S^x on L'^{Px), respectively, so that JySy^ = Syx^x 
and Jx^xx = ^xx^x- It is not difficult to see also Jxl^xx + = 
{S§x + ^ly^Jx for e > 0. These relations yield 

= ExymY){{S§x + eir'S§yi^){X)] 

-Ey[^l;{Y)]Ex[{{S^x + eI)-'S^yiJ){X)], 

where Jyip is identified with ^. The assertion is obtained if we prove that 
the operators Sxy and {Sxx~^^^)^^ are continuous with respect to B in op- 
erator norm. To see this, let X be identically and independently distributed 
with X. We have 

= Ex [Covyx [k^ {X,X)- fc^o {X, X) , ^{Y)f] 

< Ex[yaTxMB^X,B^X) - kdiB^X,B^X)]\aTy[i;{Y)]] 

from which the continuity of B ^xy is obtained by the continuity and 
boundedness of k^. The continuity of {Sxx + ^I)"^ is shown by ||(5'^x + 
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e/ri-(S|°, + e/)-i|| = ||(5|^ + e/ri(S|«,-5|x)(5|'3. + e^)'1<Fll5|x- 
S^xl □ 

To establish the continuity of Lq(B) = Tr[T,Yx'^xx ^^xy] ' argument 
in the proof of Lemma 12 cannot be apphed, because Ti^x ^ is not bounded 
in general. The assumptions (A-1) and (A-2) are used for the proof. 

Lemma 13. Suppose is continuous and bounded. Under the assump- 
tions (A-1) and (A-2), the function Lq{B) is continuous on S™(IR). 

Proof. By the same argument as in the proof of Lemma 11, it suffices to 
establish the continuity of S i— > {ip, T,yy\x''P) ^'^^ ^ ^ "^y- From Proposition 
2, the proof is completed if the continuity of the map 

inf Ya.TxY[g{Y)-f{X)] 
fen^ 

is proved for any continuous and bounded function g. 

Since f{x) depends only on B^x for any / € Ti-^, under the assumption 
(A-2), we use the same argument as in the proof of Proposition 3 to obtain 

inf Varxy[9(y)-/(X)] 

fen% 

= inf Yaix[EY\BBTx[9iy)\BB^X] - /(X)] 

+ Ex[MaTY\BBTx[9{Y)\BB^X]] 
= ErigiYf]- Ex[EY\BTx[9iY)\B^X]% 
which is a continuous function of i? G S™(M) from assumption (A-1). □ 

Lemma 14. Suppose that is continuous and bounded, and that En 
converges to zero as n goes to infinity. Under the assumptions (A-1) and 
(A-2), we have 

sup Tr[Sf^|^-{Syy-Ep^(Ef^ + e,/)-isfy}]^0 (n^oo). 

Bes;7{M) 

Proof. From Lemmas 11, 12 and 13, the continuous function Tr[Syy — 
T;Yx{^xx + ^nI)~^^^Y] converges to the continuous function Tr[SPy|j5^] 
for every B G §2^(]R). Because this convergence is monotone and §2^(]R) is 
compact, it is necessarily uniform. □ 



The proof of Proposition 7 is now easily obtained. 
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Proof of Proposition 7. Lemmas 12 and 13 show the continuity of 
Tr[Sy^^^] and Tr[Syy|^]. Lemmas 10 and 14 prove the uniform conver- 
gence. □ 

5. Conclusions. This paper has presented KDR, a new method for suf- 
ficient dimension reduction in regression. The method is based on a char- 
acterization of conditional independence using covariance operators on re- 
producing Hilbert spaces. This characterization is not restricted to first or 
second-order conditional moments, but exploits high-order moments in the 
estimation of the central subspace. The KDR method is widely applicable; 
in distinction to most of the existing literature on SDR it does not impose 
strong assumptions on the probability distribution of the covariate vector 
X. It is also applicable to problems in which the response Y is discrete. 

We have developed some asymptotic theory for the estimator, resulting in 
a proof of consistency of the estimator under weak conditions. The proof of 
consistency reposes on a result establishing the uniform convergence of the 
empirical process in a Hilbert space. In particular, we have established the 
rate Op(n~^/^) for uniform convergence, paralleling the results for ordinary 
real- valued empirical processes. 

We have not yet developed distribution theory for the KDR method, and 
have left open the important problem of inferring the dimensionality of the 
central subspace. Our proof techniques do not straightforwardly extend to 
yield the asymptotic distribution of the KDR estimator, and new techniques 
may be required. 

It should be noted, however, that inference of the dimensionality of the 
central subspace is not necessary for many of the applications of SDR. In 
particular, SDR is often used in the context of graphical exploration of data, 
where a data analyst may wish to explore views of varying dimensionality. 
Also, in high-dimensional prediction problems of the kind studied in statisti- 
cal machine learning, dimension reduction may be carried out in the context 
of predictive modeling, in which case cross-validation and related techniques 
may be used to choose the dimensionality. 

Finally, while we have focused our discussion on the central subspace as 
the object of inference, it is also worth noting that KDR applies even to 
situations in which a central subspace does not exist. As we have shown, the 
KDR estimate converges to the subset of projection matrices that satisfy 
(1); this result holds regardless of the existence of a central subspace. That 
is, if the intersection of dimension-reduction subspaces is not a dimension- 
reduction subspace, but if the dimensionality chosen for KDR is chosen to 
be large enough such that subspaces satisfying (1) exist, then KDR will 
converge to one of those subspaces. 
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APPENDIX: UNIFORM CONVERGENCE OF 
CROSS-COVARIANCE OPERATORS 

In this Appendix we present a proof of Lemma 9. The proof involves the 
use of random elements in a Hilbert space [3, 30]. Let W be a Hilbert space 
equipped with a Borel cj-field. A random element in the Hilbert space TC 
is a measurable map F -.0, ^TC from a measurable space (il, &). If TC is an 
RKHS on a measurable set X with a measurable positive definite kernel k, 
a random variable X in X defines a random element in Ti. by k{-,X). 

A random element F in a Hilbert space Ti. is said to have strong order 
p (0 < p < oo) if is finite. For a random element F of strong order 

one, the expectation of F, which is defined as the element mp £ TC such 
that {mF,g)n = E[{F,g)T-i] for all g &TC, is denoted by E[F]. With this no- 
tation, the interchange of the expectation and the inner product is justified: 
{E[F],g)'n = E[{F,g)-}i]. Note also that for independent random elements F 
and G of strong order two, the relation 

E[{F,G)n] = {E[F],E[G])n 

holds. 

Let (X,Y) be a random vector on X x y with law Pxy, and let TCx 
and TCy be the RKHS with positive definite kernels kx and ky, respec- 
tively, which satisfy (2). The random element kx{-,X) has strong order 
two, and E[k{-,X)] equals mx, where mx is given by (4). The random 
element k;\;{-, X)ky(-,Y) in the direct product TCx ^TCy has strong order 
one. Define the zero mean random elements F = kx{-,X) — E[kx{-,Xy\ and 

G = ky{;Y)-E[ky {;¥)]. 

For an i.i.d. sample (Xi, li), . . . , Yn) on X xy with law PxY: define 
random elements Fj = kx{- , Xi) — E[kx{- , X)] and Gi = ky{-,Yi) — E[ky{-,Y)]. 
Then, F, Fi, . . . , F„ and G, Gi, . . . , G n are zero mean i.i.d. random elements 
in TCx and TCy, respectively. In the following, the notation F = TCx ® TCy is 
used for simplicity. 

As shown in the proof of Lemma 4 in [13], we have 



^YX 



HS 



which provides a bound 



sup WT.Yx - Syxllns ^ sup 



(21) 



1 " 

-Y.{fPg,-e[fg]) 

1=1 



+ sup 



n ^ ■' 
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where are defined with the kernel , and = n^®ny. Also, (20) 
implies 

2 



Tr[S 



1 



n 



E 



1 



n 



^IIFI 



-Ell^illw;, 



^IIFI 



2 = 1 



n 



j=i 



from which we have 

\B(n) 



(22) 



sup |Tr[Sj^^ < sup 



1 " 



-B II 2 771 1 1 r?B II 2 



+ sup 



E^^ 

1=1 



B 



It follows that Lemma 9 is proved if all the four terms on the right-hand 
side of (21) and (22) are of order Op{l/^/n). 

Hereafter, the kernel is assumed to be bounded. We begin by consid- 
ering the first term on the right-hand side of (22). This is the supremum of 
a process which consists of real-valued random variables ||i^j^||Lfl- Let 

be a random element in 7id defined by 

U"" = ka{;B^X)-E[ka{;B^X)] 

and let C > be a constant such that |A:c;(z,z)| < for all z £ Z. Prom 
WU^Wua < 2C, we have for B,Be SJ{R) 



I P'S||2 



\F 



B\\2 



"■X 



< \\u 



B 



B\ 



<4C||f/^-C/ 
The above inequality, combined with the bound 



d • 



(23) 



U"" -U''\\n,<2(t>{x)D{B,B) 



tB 



obtained from assumption (A-3), provides a Lipschitz condition |||-F^||?,s 



\F 



B\\2 



< 8C(j){x)D{B,B), which works as a sufficient condition for the 



uniform central limit theorem [31], Example 19.7. This yields 



sup 

_Be§™(if 



1 " 

2=1 



^B-^\\f W^B 
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Our approach to the other three terms is based on a treatment of em- 
pirical processes in a Hilbert space. For B G SJP(M), let Uf = kd{-,B^Xi) — 
E[kci{- , B'^ X)] be a random element in Ti^- Then the relation {k^{-,x),k^{-, 
x))y_B =kd{B'^x,B'^x) = {kdi-,B'^x),kdi-,B'^x))nd implies 



(24) 



(25) 



1 



n 



B 



1 " 

-EFfG-EiFG] 



n 
n 



1 

-Y,UfG-E[U^G] 



Hd®Hy 



Note also that the assmnption (A-3) gives 

(26) ||C/^G - U'^GWud^ny < 2^ky{y,y)ct>{x)D{B, B). 

From (23)-(26), the proof of Lemma 9 is completed from the following propo- 
sition: 



Proposition 15. Let {X,Bx) be a measurable space, let Q be a compact 
metric space with distance D, and let 7i be a Hilbert space. Suppose that 
X, Xi, . . . , Xn are i.i.d. random variables on X , and suppose F :JY x Q ^TC 
is a Borel measurable map. //supggQ 11-^(2;; (^)\\'H < oo for all x £ X and there 
exists a measurable function : — > M such that E[(j){X)'^] < oo and 



(27) 



\Fix; 9i) - Fix; 92)\\h < 0(x)Z)(0i, ^2) (V^i, ^2 e 6), 



then we have 



sup 

6»e0 



^Y.(P(X,;0)-E[FiX;9)]) 



1=1 



Op(l) (n^oo). 



n 



The proof of Proposition 15 is similar to that for a real-valued random 
process, and is divided into several lemmas. 

I.i.d. random variables (Ti,...,(T,„ taking values in {+1,-1} with equal 
probability are called Rademacher variables. The following concentration 
inequality is known for a Rademacher average in a Banach space: 



Proposition 16. Let ai, . . . ,a„ be elements in a Banach space, and let 
(Ti, . . . , (Tn be Rademacher variables. Then, for every t > 
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Proof. See [21], Theorem 4.7 and the remark thereafter. □ 



With Proposition 16, the following exponential inequality is obtained with 
a slight modification of the standard symmetrization argument for empirical 
processes. 



Lemma 17. Let X,Xi, . . . ,Xn and 7i he as in Proposition 15, and de- 
note [Xi, . . . ,Xn) by X.„. Let F-.X^Ti he a Borel measurable map with 



E\\F{X)\\^ < oo. For a posit 



ive num. 



her M such that E\\F{X)\\^ < M, de- 



fine an event An by ^X^iLi ^ Then, for every t>0 and suffi- 

ciently large n, 



Pr <^X, 



1 



n 



Y^{F{X,)-E[F{X)]) 



i=l 



>t\r\An\ < 8 exp 



/ nt^ \ 
V 1024A/j' 



n 



Proof. First, note that for any sufficiently large n we have Y'T{An) > | 
and Pr(||iEr=i(^(^*) " ^[^(^)])ll < |) > f- We consider only 
such n in the following. Let X„ be an independent copy of X„, and let 
An = {X„|i Er=i \\F{Xi)\\^ < M}. The obvious inequality 



Pr <^X. 



-J2iFiX,)-E[FiX)]) 



i=l 



>t}nAn 



n 



xPr <^X„ 



1 



n 1- , 



<Pr <^(X„,X„) 



Y^{F{X,)-E[F{X)]) 
1 



n 



<- nA„ 



n 



J2{FiX,)-F{X,)) 



1=1 



n 



>-\nAnnAn 



and the fact that := {(X„,X„)|2^Er=i(l|i^(^i)f + 11^^(^011') < M} 
includes An H An gives a symmetrized bound 



Pr <^X„ 



-Y^{F{X.)-E[F{X)]) 



n 



i=l 



>t} nAr, 



n 



<2Pr <^(X„,X„) 



-J2{F{X,)-F{X,)) 



i=l 



H 



Introducing Rademacher variables cti, . . . , cr„, the right-hand side is equal 



to 



2Pr <^(X„,X„,{ctJ) 



n 



Y,ai{F{Xi)-F{X,)) 



i=l 



>->nB„ 



n 
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which is upper-bounded by 



4Pr 



n 

1=1 
4Ex. Pr 



f 1 " ^ 

>-^^nd-Y^\\F{Xi)\\l,<M 



n 



i=l 



n 



i=l 



n 



> ^l^r^ ) l{x„ec„} 



where C„ = {^n\^Y.i=i ^ ^M}. From Proposition 16, the last 



hne is upper-bounded by 4exp(— oov^'^^^dt^/v mi2 ) ^ 4exp( 

•^'^ Z_^i—l II-'' (.^iJII 



■ 1024M ^ 



□ 



Let be a set with semimetric d. For any 6 > 0, the covering number 
N{6, d, 0) is the smallest m € N for which there exist m points 9i, ... ,9m in 
such that mmi<:i<:md{9,9i) < 6 holds for any 9 €Q. We write N(5) for 
N[5,d,Q) if there is no confusion. For 5 > 0, the covering integral J {6) for 
is defined by 

J{5)= (\s\og{N{uf/u)f/^du. 
Jo 

The chaining lemma [25], which plays a crucial role in the uniform central 
limit theorem, is readily extendable to a random process in a Banach space. 



Lemma 18 (Chaining lemma). Let Q be a set with semimetric d, and let 
{Z{9)\9 G 0} be a family of random elements in a Banach space. Suppose 
has a finite covering integral J{5) for < 5 < 1 and suppose there exists 
a positive constant R> such that for all 9,r] € Q and t>0 the inequality 

Pr(||Z(0)-Z(r?)||>td(0,7?))<8exp(^-^t2^ 

holds. Then, there exists a countable subset 0* of such that for any < 
e<l 

Fr( sup \\Z{9)-Z{ri)\\>26RJ{d{9,ri))] <2e 

\e,-ne0*,d{e,ri)<e / 

holds. If Z{9) has continuous sample paths, then 0* can be replaced by 0. 

Proof. By noting that the proof of the chaining lemma for a real- valued 
random process does not use any special properties of real numbers but the 
property of the norm (absolute value) for Z[9), the proof applies directly to 
a process in a Banach space. See [25], Section VII. 2. □ 
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Proof of Proposition 15. Note that (27) means 



-Y,{F{Xi;9i)-F{Xi;e2)) 
n ^ 



1=1 



n 



<D{ei,e2?-Y.<t>{x,) 



Let M > be a constant such that E\(\){X)'^\ < M, and let A„ = {X„| ||^ x 
Ei=iiF{Xi;ei) - F(Xi;02))|lw < MD{ei,d2f}. Since the probabihty of An 
converges to zero as n — > oo, it suffices to show that there exists 6 > such 
that the probabihty 



\ = Pr X„|A„n<^sup 



-j=Y^{F{x,-e)-E[F{x-e)]) 



i=l 



>6 



H 



satisfies limsup^^o^Pn = 0. 

With the notation Fg{x) = F{x;9) — E[F{X;9)], from Lemma 17 we can 
derive 



Pr|^A„n|X 

< 8exp 



1 



4 = 1 



>t 



n 



hi2-2MD{ei,e2f 



for any t > and sufficiently large n. Because the covering integral J[5) 
with respect to D is finite by the compactness of 0, and the sample path 
9 ^ H-> Z]r=i Fe{Xi) £ W is continuous, the chaining lemma implies that 
for any < e < 1 



Pr A„n<^X„ 



sup 

9i,e2&e,D(ei,e2)<e 



Y,{Fe,{X,) - Fe,{X,)) 



1=1 



n 



> 26 • 512Af • J(e) >] <2e. 



Take an arbitrary e £ (0,1). We can find a finite number of partitions 

= Ua=i ©a {^{^) ^ I^) so that any two points in each Bq are within the 
distance e. Let 9a be an arbitrary point in 0^. Then the probability P.„ is 
bounded by 



(28) 



^11 < Pr max 

\l<a<u{e) 



1 



i=l 



> 



n 



+ Pr A„n<^X„ 



sup 

6»,r/ee,D(0,»))<e 



1 " 

/=5:(F,(X,)-F,(X0) 

i=l 



> 



n 
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From Chebyshev's inequality the first term is upper-bounded by 



i^{e) Pr 



1 " 

1=1 



> 



< 



n 



Me)E\\F0^{X)\\l, 
52 



If we take sufficiently large 6 so that 512M J(e) < 5/2 and WUlt < 

6^, the right-hand side of (28) is bounded by 3e, which completes the proof. 
□ 



Acknowledgments. The authors thank the Editor and anonymous refer- 
ees for their helpful comments. The authors also thank Dr. Yoichi Nishiyama 
for his helpful comments on the uniform convergence of empirical processes. 



REFERENCES 

[1] Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc. 68 
337-404. MR0051437 

[2] Bach, F. R. and Jordan, M. I. (2002). Kernel independent component analysis. J. 

Mach. Learn. Res. 3 1-48. MR1966051 
[3] Baker, C. R. (1973). Joint measures and cross-covariance operators. Trans. Amer. 

Math. Soc. 186 273-289. MR0336795 
[4] Breiman, L. and Friedman, J. H. (1985). Estimating optimal transformations 

for multiple regression and correlation. J. Amer. Statist. Assoc. 80 580-598. 

MR0803258 

[5] Chiaromonte, F. and Cook, R. D. (2002). Sufficient dimension reduction and 
graphics in regression. Ann. Inst. Statist. Math. 54 768-795. MR1954046 

[6] Cook, R. D. (1998). Regression Graphics. Wiley, New York. MR1645673 

[7] Cook, R. D. and Lee, H. (1999). Dimension reduction in regression with a binary 
response. J. Amer. Statist. Assoc. 94 1187-1200. MR1731482 

[8] Cook, R. D. and Ll, B. (2002). Dimension reduction for conditional mean in regres- 
sion. Ann. Statist. 30 455-474. MR1902895 

[9] Cook, R. D. and Weisberg, S. (1991). Discussion of Li. J. Amer. Statist. Assoc. 
86 328-332. 

[10] Cook, R. D. and Yin, X. (2001). Dimension reduction and visualization in discrim- 
inant analysis (with discussion). Aust. N. Z. J. Stat. 43 147-199. MR1839361 

[11] Flury, B. and Riedwyl, H. (1988). Multivariate Statistics: A Practical Approach. 
Chapman and Hall, London. 

[12] Friedman, J. H. and Stuetzle, W. (1981). Projection pursuit regression. J. Amer. 
Statist. Assoc. 76 817-823. MR0650892 

[13] FuKUMlzu, K., Bach, F. R. and Gretton, A. (2007). Statistical consistency of ker- 
nel canonical correlation analysis. J. Mach. Learn. Res. 8 361-383. MR2320675 

[14] FuKUMlzu, K., Bach, F. R. and Jordan, M. I. (2004). Dimensionality reduction 
for supervised learning with reproducing kernel Hilbert spaces. J. Mach. Learn. 
Res. 5 73-99. MR2247974 

[15] FuKUMizu, K., Gretton, A., Sun, X. and Scholkopf, B. (2008). Kernel measures 
of conditional dependence. In Advances in Neural Information Processing Sys- 
tems 20 (J. Piatt, D. KoUer, Y. Singer and S. Roweis, eds.) 489-496. MIT Press, 
Cambridge, MA. 



36 



K. FUKUMIZU, F. R. BACH AND M. I. JORDAN 



[16] Gretton, a., Bousquet, O., Smola, A. J. and Scholkopf, B. (2005). Measuring 
statistical dependence with Hilbert-Schmidt norms. In 16th International Con- 
ference on Algorithmic Learning Theory (S. Jain, H. U. Simon and E. Tomita, 
eds.) 63-77. Springer, Berlin. MR2255909 

[17] Groetsch, C. W. (1984). The Theory of Tikhonov Regularization for Fredholm 
Equations of the First Kind. Pitman, Boston, MA. 

[18] Hristache, M., Juditsky, A., Polzehl, J. and Spokoiny, V. (2001). Struc- 
ture adaptive approach for dimension reduction. Ann. Statist. 29 1537-1566. 
MR1891738 

[19] KOBAYASHI, S. and NOMIZU, K. (1963). Foundations of Differential Geometry, Vol. 

1. Wiley, New York. 
[20] Lax, p. D. (2002). Functional Analysis. Wiley, New York. MR1892228 
[21] Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces. Springer, 

Berlin. MRl 102015 

[22] Li, B., Zha, H. and Chiaromonte, F. (2005). Contour regression: A general ap- 
proach to dimension reduction. Ann. Statist. 33 1580-1616. MR2166556 

[23] Li, K.-C. (1991). Sliced inverse regression for dimension reduction (with discussion). 
J. Amer. Statist. Assoc. 86 316-342. MR1137117 

[24] Li, K.-C. (1992). On principal Hessian directions for data visualization and dimension 
reduction: Another application of Stein's lemma. J. Amer. Statist. Assoc. 87 
1025-1039. MR1209564 

[25] Pollard, D. (1984). Convergence of Stochastic Processes. Springer, New York. 
MR0762984 

[26] Reed, M. and Simon, B. (1980). Functional Analysis. Academic Press, New York. 
MR0751959 

[27] Samarov, a. M. (1993). Exploring regression structure using nonparametric func- 
tional estimation. J. Amer. Statist. Assoc. 88 836-847. MR1242934 

[28] Sriperumbudur, B., Gretton, A., Fukumizu, K., Lanckriet, G. and 
Scholkopf, B. (2008). Injective Hilbert space embeddings of probability mea- 
sures. In Proceedings of the 21st Annual Conference on Learning Theory (COLT 
2008) (R. A. Servedio and T. Zhang, eds.) 111-122. Omnipress, Madison, WI. 

[29] TiBSHlRANi, R. (1996). Regression shrinkage and selection via the Lasso. J. Roy. 
Statist. Soc. Ser. B 58 267-288. MR1379242 

[30] Vakhania, N. N., Tarieladze, V. I. and Chobanyan, S. A. (1987). Probability 
Distributions on Banach Spaces. Reidel, Dordrecht. MR1435288 

[31] VAN DER Vaart, A. W. (1998). Asymptotic Statistics. Cambridge Univ. Press, Cam- 
bridge. MR1652247 

[32] Wahba, G. (1990). Spline Models for Observational Data. CBMS-NSF Regional Con- 
ference Series in Applied Mathematics 59. SIAM, Philadelphia, PA. MR1045442 

[33] XiA, Y., TONG, H., Li, W. and Zhu, L.-X. (2002). An adaptive estimation of di- 
mension reduction space. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 363-410. 
MR1924297 

[34] Yin, X. and Bura, E. (2006). Moment-based dimension reduction for multivariate 

response regression. J. Statist. Plann. Inference 136 3675-3688. MR2256281 
[35] Yin, X. and Cook, R. D. (2005). Direction estimation in single-index regressions. 

Biometrika 92 371-384. MR2201365 
[36] Zhu, Y. and Zeng, P. (2006). Fourier methods for estimating the central subspace 

and the central mean subspace in regression. J. Amer. Statist. Assoc. 101 1638- 

1651. MR2279485 



KERNEL DIMENSION REDUCTION 



37 



k. fukumizu 

Institute of Statistical Mathematics 
4-6-7 minami-azabu 
Minato-ku, Tokyo 106-8569 
Japan 

E-MAIL: fukumizu@ism.ac.jp 



F. R. Bach 

INRIA— WILLOW Project-Team 
Laboratoire d'Informatique 

DE l'Ecole Normale Superieure 
CNRS/ENS/INRIA UMR 8548 
45, RUE d'Ulm 75230 Paris 
France 

E-MAIL: francis.bach@mincs.org 



M. I. Jordan 

Department of Statistics 
Department of Computer Science 

and Electrical Engineering 
University of California 
Berkeley, California 94720 
USA 

E-mail: jordanOstat. berkclcy.edu 



